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Abstract — This paper describes a new set of block source codes 
well suited for data compression. These codes are defined by sets 
of productions rules of the form al — > b,_ where a e A represents 
a value from the source alphabet A and I, b are -small- sequences 
of bits. These codes naturally encompass other Variable Length 
Codes (VLCs) such as Huffman codes. It is shown that these 
codes may have a similar or even a shorter mean description 
length than Huffman codes for the same encoding and decoding 
complexity. A first code design method allowing to preserve 
the lexicographic order in the bit domain is described. The 
corresponding codes have the same mean description length (mdl) 
as Huffman codes from which they are constructed. Therefore, 
they outperform from a compression point of view the Hu- 
Tucker codes designed to offer the lexicographic property in 
the bit domain. A second construction method allows to obtain 
codes such that the marginal bit probability converges to 0.5 as 
the sequence length increases and this is achieved even if the 
probability distribution function is not known by the encoder. 

I. Introduction 

Grammars are powerful tools which are widely used in 
Computer Sciences. Most of lossless compression algorithms 
can actually be formalized with grammars. Codes explicitly 
based on grammars have been considered as a mean for data 
compression [1]. These codes losslessly encode a sequence 
in two steps. A first analysis step consists in finding the 
production rules. A second step applies these rules to the 
sequence to be encoded. These codes have mainly been 
compared with dictionary-based compression algorithms such 
as LZ77 [2] or [3], which also implicitly use the grammar 
formalism. All these codes have in common the fact that the 
set of production rules depends on the data to be encoded, and 
not only on the source properties. 

In this paper, a new set of codes based on specific produc- 
tion rules is introduced. In contrast with LZ77-like algorithms 
or grammar codes, the set of production rules is fixed. In con- 
trast with grammar codes introduced so far in the literature, the 
codes described here encompass Huffman codes [4] (but not 
Variable-to-Fixed Length codes such as Tunstall Codes [5]). 
The form of the production rules is presented in Section |ll| 
The sequence of bits generated by a given production rule 
may be re-written by a subsequent production rule. They lead 
to the same encoding and decoding complexity as Huffman 
codes. A possible drawback of these codes would be that they 
require backward encoding. However, since most applications 
deal with block encoding, the forward encoding property is not 
absolutely required. In Section |lll| the decoding and encoding 



procedures with automata will be described. The compression 
efficiency of these codes will be analyzed in Section llVl It is 
shown in an example that the proposed codes allow for better 
compression efficiency than Huffman codes. 

Two code construction methods are then described. The 
first method constructs a set of production rules preserving 
the lexicographic order of the original source sequence in the 
bit domain. This property is obviously of interest for database 
applications, since it allows to process comparative queries 
directly in the bit domain, hence avoiding the prematurate 
decoding of the compressed dictionary for the query itself. 
Note that the lexicographic VLC of minimal mdl is usually 
obtained with the Hu-Tucker algorithm [6]. This algorithm 
is optimal in the set of VLCs. For some sources, the Hu- 
Tucker codes may have the same compression efficiency as 
Huffman codes, but it is not the case in general. The method 
proposed in Section [V] constructs lexicographic codes with 
the same compression performance as Huffman codes and 
that allow for symbol per symbol encoding and decoding 
procedures. Obtaining together the properties of lexicographic 
order preservation and high compression efficiency illustrates 
the interest of codes based on the proposed set of production 
rules. 

The second construction method described in Section IVII 
allows to obtain codes, for stationary sources, such that the 
marginal bit probability is equal to 0.5. The main advantage 
of these codes is that this probability is equal to 0.5 even if 
the actual source probabilities are not known at the encoder, 
or if the assumed a priori probabilities differ from the true 
probabilities. Since channel encoders widely assume that Qs 
and Is have the same probability, this property is of interest 
when compressed bitstreams protected by such encoders are 
transmitted over noisy channels. 

II. Problem statement and Notations 

In the sequel random variables are denoted by upper cases 
and the corresponding realizations are denoted by lower cases. 
Sets are denoted by calligraphic characters. The cardinality of 
a given set X is denoted \X\. We define X + = IJ^li an d 
X* = {e} U X + , where e denotes the void sequence. Hence 
X* denotes the set of sequences composed of elements of X. 
Let S 6 A + be a sequence of source symbols taking their 
values in a finite alphabet A = {ai, . . . et^, . . .}. The length of 
such a sequence is denoted L(S). The alphabet A is assumed 



to be ordered according to a total order -<. Without loss of 
generality, we assume that a\ ~< CL2 ■ ■ ■ ~< a, . . . -< cl\m. Let 
us define B — {0, 1}. In the sequel, the emitted bitstream is 
denoted E = E\ . . .E L m\ £ B* and its realization is denoted 
e = ex • • .e L(e ). 

Definition 1: A Variable Length Re-writing System (VLRS) 
is a set 1Z = [j i£ ^'R-i, where IZi denotes the set of rules 
related to a given symbol a^, defined as 

rx,x : axlx.i -* = &x,i • • • &i,x 6l,1 \ 

r- ■ a- 1 ■ ■ — > b- ■ = b 1 A L (&»,i) 

r\A\,\n\ Al \ ■■ a\A\l\A\,\n ]Al \ Ml>|w^|| 

where kj £ B*,bij £ B + . This set is such that 

1) Vi, |72<| > 1, 

2) The set Ul=i Uj^xi^.j) f° rms a prefix code (i.e. no 
codeword is the prefix of another [7]). 

3) Vi, {Jj = i{h.j} is the set {e} or forms a full prefix 
code (i.e, such that the Kraft sum is equal to 1). 

4) Vi Vi' ^ bij — Ivji or bij is not a prefix of 

h',j'- 

These production rules allow to transform a sequence s of 
symbols into a sequence e of bits by successive applications 
of production rules. These rules are assumed to be reversible: 
inverting the direction of the arrow allows to recover a given 
sequence s from the corresponding bitstream e. Note that a 
given production rule absorbs a symbol (a^) and some bits 
(h,j) from the temporary term to be encoded, and generates a 
given sequence of bits (bij). Huffman codes are covered by 
this definition. More generally, a VLRS is a Fixed-to- Variable 
(F-to-V) Length code if Vi \TZi\ = 1 and l l = {e}. 

Example 1: CodeCi = {0, 10, 11} can be seen as the following 
VLRS: 

n,i : ax —> 
r-2,1 ■ a-2 — ► 10 
rs,i : 0,3 — * 11 

Note that Definition 1 does not warranty that such a system 
leads to a valid prefix code. For example, a rule r,j where bij 
is a prefix of Ijj is not valid. In this paper, we focus on VLRS 
leading to valid codes. Note that Suffix-constrained Codes 
introduced in [8] form a subset of VLRS and are characterized 
as follows. 

Definition 2: A suffix-constrained code is a VLRS such that 
Vi, j lij is a suffix of bij. 

Example 2: The following VLRS C2 is a suffix-constrained 




C12 «3 

a 2 03 



Fig. 1. Examples of VLRS: a VLC Ci and a suffix-constrained code C2- 
On the right, the transitions triggered by the production rules are depicted by 
arrows. 



code: 



rx,x : 


axO - 


-> 10 


rx,2 : 


ail - 


-> 01 


r 2,l : 


a 2 - 


-> 00 


r 3,l : 


a 3 - 


-> 11 



Note that Code C2 can not be encoded in the forward 
direction. We will come back on this point in Section |lll| The 
two following codes will also be considered in the sequel. 
Note that these codes are not suffix-constrained codes. 

Example 3: C3 is defined as 



?"X,l : 


ax - 


-> 00 


r2,\ ■ 


a 2 - 


-> 01 


r2,2 ■ 


a 2 l - 


-> 10 


r 3,l : 


a 3 - 


-> 11 



Example 4: C4 is defined as 



r M : 


axl - 


-> 


r l,2 : 


axO - 


-> 10 


r 2,i : 


a 2 - 


4 110 


r 3,i : 


a 3 - 


4 111 



VLRS can also be represented using trees, as depicted in 
Fig. H The tree structure corresponds to the one of the prefix 
code defined by U<=i Uj=i'{^»,i}- Leaves correspond to both 
the symbol <Zj and the sequence of bits lij. 



III. Encoding and Decoding with automata 

On the encoder side, the purpose of production rules is to 
transform the sequence s into the sequence e of bits. Any 
segment of the current sequence (composed of symbols and 
bits, initialized by s) can be rewritten if there exists a rule 
having this segment as an input (this input is composed of one 
symbol and a variable number of bits). When the production 
rules stop, the sequence contains only bit entities. The set of 
rules defining a VLRS does not generally allow to encode the 
sequence S in the forward direction. Therefore, the encoding 
must be processed backward. To initiate the encoding process 
a specific rule must be used to encode the last symbol of 
the sequence. Indeed the last symbol may not be sufficient 
to trigger a production rule by itself. In most cases, they can 
be arbitrarily defined assuming that missing bit(s) equal 0, at 
the condition that the termination bit(s) do(es) not trigger a 
production rule. Hence, the choice is valid for the codes C\, 
C2 and C3 but should not be used for code C4, since triggers 
the rule 7*1 x. 

Example 5: Let s± — a\aiaiaj,aia\a\a\ be a sequence of 
symbols taking their values in the alphabet Ai = {a%, 02, 03}. 
This sequence is encoded with Code C2. Since the last symbol 
is ai, no rule applies directly. Therefore, the termination bit is 
concatenated to this sequence in order to initiate the encoding. 
The encoding then proceeds as follows: 



7*1,1 : sxO = 0102020302010101 

7*1,2 : 01020203^201 011 

7*1,1 : 010202030201010 

7*2,1 : 01020203021010 

7*3,1 : 01020203001010 

7*2,1 : 01020211001010 

r 2 ,i : oiOzOOHOOlOlO 

r M : aiOOOOHOOlOlO 
ei = 1000011001010 



In [8], it was shown that transmitting the termination bit 
is not required for suffix-constrained codes, as shown in 
Example [5] This is due to the fact that a bit generated 
by a production rule of a suffix-constrained code will not 
be modified by a subsequent production rule. Since these 
termination bits may be required in the general case, it will be 
assumed that they are known at the decoder. In the following 
example, the termination bit must be 1. Note that the sequence 
is encoded with less than 1 bit per symbol. 

Example 6: Let us now consider the sequence s' x = 
aioioioiai. This sequence is encoded with code C4 as 



7*x,i : s'^l = QifliaiQi ai 1 
7*1,2 : OiOiOiaiO 

7*1,1 '■ QiQi flil O 

7*1,2 : fli oiO O 
ri'i : oTlOO 
ei = 000 



On the decoder side, the decoding is processed forward 








Fig. 2. Decoding automata corresponding to the codes C\, C2, C3 and C4 
and corresponding decoding trellises. Transitions corresponding to 0s and lj 
are respectively plotted with dotted and solid lines. 

using reverse rules. The encoding and decoding algorithms 
are implemented using automata. These automata are used to 
catch the memory of the encoding and decoding processes. 
This memory corresponds to a segment of bits that may be 
useful for the next production rule. Hence, they are obtained 
directly from the set of production rules. The transitions on the 
automaton representing the encoding process are triggered by 
symbols. The internal states of the automaton are given by the 
variable length segments of bits {hj}. This automaton may 
be reduced if a variable length bit segment kj is a prefix of 
another segment ly^i (in that case, according to Definition ^ 
we have i ^ i'). If Vi,j, lij = e, there is only one internal 
state {e} for the encoding automaton corresponding to code 
C\. The sets of states of encoding automata of codes C2, C3 
and C4 are identical and are equal to {0, 1}. 

The states of the decoding automata correspond to bit 
segments that have already been decoded, but which are not 
sufficient to identify a symbol. For VLCs such as Huffman 
codes, these internal states correspond to the internal nodes of 
the decoding codetree. 

Example 7: The set of internal states of codes C\, C2, C3 and 
C4 are respectively {e , 1}, {e, 0, 1}, {e, 0, 1} and {e, 1, 11}. 

The graphical representations of the decoding automata may 
be deduced from the tree representations given in Fig. [2 
These automata are depicted in Fig. |2] The decoding trellises 
corresponding to these automata are depicted on the right. For 
sake of clarity, the symbols generated by the bit transitions are 
not shown. However, note that the set of generated symbol(s) 



must also be associated to each bit transition. For the codes C±, 
C2 and C3, at most 1 symbol is associated to each bit transition. 
It is not the case for Code C4, where the transition starting 
from decoding state 1 triggered by the bit generates the 
symbol a\ twice. As shown in Example [6] and demonstrated 
in Section HVl this transition allows to encode long sequences 
of ai with less than 1 bit, at the cost of a higher encoding 
cost for the symbols a 2 and 03. 

IV. Compression efficiency 

In this section, we analyse the compression efficiency of 
VLRSs. Let us assume that S is a memoryless source charac- 
terized by its stationary probability distribution function (pdf) 
on A: v = {P(fli), . . . P(ai), . . .}. Let 

8(ru) = L(hj) - LQu) (1) 

denote the number of bits generated by a given production rule 
Tij. Note that for the particular case where Vi, Vj, j' S(ri,j) = 
S(ri t j'), the mdl is equal to Ea-ei^( a »)^,i' 

Example 8: Let us assume that S is a memoryless source of 
pdf fj, 1 — {0.7, 0.2, 0.1}. The entropy of this source is 1.157. 
The mdl of Code C\ is equal to 1.3. For the code C2, we have 
8(ri,i) — S(ri.2) = 1 and 8 {r 2 a) — ^(^3,1) = 2. The mdl of 
this code is also equal to 1.3. 

Let R t : StL t — > B t denote the rule to be used in order 
to encode a given symbol St. Since the encoder proceeds 
backward and since the source S is memoryless, the process 
(Z t i) — {Rl(s)> ■ ■ ■ Rl(s), ■ ■ ■ Ri) obtained from the process 
(Rt) by reversing the symbol clock t, i.e. (Zf)t'=i....L(s) = 
(-RL(S)-t+i)i=i,...i(S)> forms an invariant Markov chain. In 
other words we'have ¥{Z V \Z X , ■ ■ ■ Zt>-\) = F(Z t >\Z t '- X ) = 
¥(Rt\Rt +l ,...R m ) = HRt\R t +i) = P(^L(s)-i|i?L ( s))- 
If St — the rule r^j is triggered if and only if the realization 
of L t is a prefix of the bits B t+ i generated by the previous 
production rule. As a consequence, the probability F(R t \R t + x ) 
can be deduced from the source pdf as 

V(Rt=r i j\R t+ i=r V j) (2) 

= F(R t =r itj \B t+1 =h, f ) 

= F(S t = a u L t = \j\B t+1 = b v . y ) 

J P(ai) if li.j is prefix of bi>,j', 
\ otherwise. 

Assuming that (Z t >) is irreducible and aperiodic, the 
marginal probability distribution F(Z t i = r^j) is obtained 
from the transition matrix F(Z t >\Z t '- X ) as the normalized 
eigenvector associated to the eigenvalue 1. As t' grows to 
infinity (which requires that t — > 00), the expectation of 
S(Zf) is the expectation of the number of bits generated by a 
production rule. With the Cesaro theorem, it also provides the 
asymptotic value of the mdl as the sequence length increases. 

Example 9: For the code C4, the transition matrix correspond- 



ing to the source pdf of Example\^is 

0.7 0.7 0.7 " 

0.7 

0.2 0.2 0.2 0.2 ' 

0.1 0.1 0.1 0.1 

which leads to F(R t = r itj ) = {0.412,0.288,0.2,0.1}. 
Finally, the mdl of this code is md/(C 4 ) = 0.412 x + 0.288 x 
1 + 0.2 x 3 + 0.1 x 3 = 1.188. 

The mdl obtained in Example [9] is much closer to the 
entropy than the mdl obtained with Huffman codes. The 
expected number of bits required to code the symbol a\ is 
less than 0.5 bit. One can also process the exact mdl of a 
VLRS for sequences of finite length. Indeed, the expectation 
of the number of termination bit(s) as well as the pdf F(R t = 
rij\t — L(S)) of the last rule can be obtained from the 
termination bit choice and from the source pdf. The exact 
probability F(R t = rij\t = r) of having a given rule for a 
given symbol clock r can then be computed and subsequently 
one can deduce the expectation of the number of bits generated 
to encode the symbol S T . 

V. Lexicographic Code Design 

This section describes a VLRS construction method which 
allows to preserve the lexicographic order of the source 
alphabet in the bit domain. As a starting point, we assume that 
the Huffman code corresponding to the source pdf fi is already 
known. The length of the Huffman codeword associated to 
the symbol ai is denoted fej. Let k + = max^ ki denote the 
length of the longest codeword. First, let us underline that 
the union (J. Ahj} of all the bit sequences bij will form a 

Fixed Length Code (FLC) T of length k + . T contains 2 fe 
codewords. These codewords will be assigned to productions 
rules in the lexicographic order. Starting with the smaller 
symbol a\, 2 k ~ ki rules are defined for symbol cij. The left 
part of these rules are defined so that the set {li,j}je[i..\Ki\] 
forms a FLC of length k+ - h. If k t = k+, this FLC 
only contains the element e. The 2 k ~ ki smallest remaining 
codewords of T, i.e. those which have not been assigned to 
previous symbols of T, are then assigned to these productions 
rules so that Vj,hj < li.j' hj < bi.j'- By construction, the 
proposed algorithm leads to a VLRS with the lexicographic 
property and with the same compression efficiency as the 
code from which it is constructed. In some cases, the set of 
production rules generated in previous steps may be simplified. 

Example 10: Let us now assume that the source S is mem- 
oryless of pdf fi 2 — {0.2,0.7,0.1}. Since 0,2 has the highest 
probability, the Huffman code H2 — {10, 0, 11} corresponding 
to this pdf is not lexicographic. The Hu-Tucker code associated 
to this source is the code C\ proposed in Example{l\and its mdl 
is equal to 1.8. 

The VLRS is constructed according to the proposed con- 
struction procedure. For H2, we have ki = 1 and ki = fca = 
k+ = 2. Hence T = {00,01, 10, 11}. Since ki = 2, only 1 
production rule n,i is assigned to the symbol <zi and b\A = e, 



C\ 






d20 03! fl30 ail 



a 1 1 



Fig. 3. Primitive code C\, its opposite Ci and the resulting mirror VLRS. 



which implies ri t \ : a\ — ► 00. The symbol a\ is then assigned 
two production rules r2,i and T2,2 so that T2,\ : <220 — > 01 and 
T2,2 : 02 1 — > 10. The construction algorithm finishes with the 
assignment of rule r 3: i : a 3 — > 11 to symbol a 3 . Finally, we 
obtain the code C3 proposed in Example]^ for which the mdl is 
equal to 1.3 together with the lexicographic property. 

Although the proposed construction allows to obtain lex- 
icographic codes with the same compression efficiency as 
Huffman codes, it does not construct, in general, the best 
lexicographic VLRS from a compression efficiency point of 
view. One may find some lexicographic VLRS with lower mdl. 

VI. Mirror Code Design 

The code design described in this section allows to obtain 
codes with bit marginal probabilities that are asymptotically 
equal to 0.5 as the sequence length increases. Let us again 
assume, as a starting point, that we know a VLC code 
Ti = {61,1, ... 6mm}. Let us now consider the code Ti = 

\bi,i, . ■ -b\M,i} defined so that each bit transition of the 
codetree characterizing Ti is the opposite value from the 
corresponding bit transition in Ti, as depicted in Fig. [5] 

The VLRS is obtained by putting together these two codes. 
The codes Ti and lrl are respectively used to define the two 
sets of |.4| production rules forming the new VLRS as 



M = 



Qbi,i}iE[i..\A\] 
1 &i,i}ie[i..U|]- 



(3) 



Note that the production rules associated to codes Ti and Ti 
respectively define the subtrees corresponding to bit transitions 
and 1. Note also that the resulting code, by construction, is 
a suffix-constrained code. 

Example 11: The construction associated to the code C\ leads 
to the following VLRS: 



r U : 


01 - 


-> 00 


r 2,i : 


a 2 - 


-> 010 


f3,l : 


a 3 l - 


-> 011 




ail 
02 1 
a 3 



11 
101 
100. 



obtained from H.—C1 



Proof of Vn, lim^(g)_ >00 P(£ n = 0) = 0.5: Let us consider 
a VLRS A4 constructed according to the previous guidelines. 
The notation bi,j refers to this VLRS (not to the VLC from 
which it is constructed). Let ft = P(S t 1 = 0) denote the 
marginal bit probability associated to the first bit generated by 
a given production rule. Since the VLRS is constructed from 
a VLC, we have Vi, j, 5{ri.j) > 1, which means that every 
rule produces at least one bit. The value f t can be written as 



ft = 



E 

ie[i..\A\],j£[i..2] 

E 

ie[i..\A\],j=i 

E 

ie[i..\A\]j=i 



(Rt=r id ,Bl =0) 

tt i> n t+l — L i,l ) 



(4) 



(5) 



E 

ie[i..|^|],i= 



ns t 



L(L t 



0)ft 



t+1 



— a,- , L 



L(L t 



= 1) (1 - ft+i). (6) 



Let a — J2 a eA ^( a ii if 1 — 0)- This entity corresponds 
to the sum of the probabilities of the symbols to which a 
codeword ending with has been assigned. Note that < 
a < 1. Inserting this entity in Eqn. [6] we obtain 



f t = af t+L + (l-a)(l-f t+1 ) 



(7) 



obtained from 'H—Ci 



We can now study the asymptotic behavior of this sequence 
as t' = L(S) —t + 1 tends to +00 (note that /l(s) is a 
constant). The absolute value of the derivative of the function 
g(x) — a x + (1 — a) (1 — x) is strictly lower than 1 when < 
a < 1. Consequently, the fixed-point theorem applies and the 
sequence /l(s)) /l(S)-1) • • • /*' converges to the solution of 
x = g(x), which is 0.5. Subsequently, Vi, opposite codewords 
61,1 and bi t 2 are equiprobable, which concludes the proof. □ 

VII. Conclusion and Perspectives 
VLRSs have a low encoding and decoding complexity, 
allowing for instantaneous decoding and may have a lower 
mdl than Huffman codes. The degree of freedom that they 
offer allows to design codes with interesting properties, as 
shown in sections fVl and fVll Hopefully, the design of I VII may 
lead to soft decoding results outperforming the ones obtained 
with source codes with a marginal bit probability not equal to 
0.5. 
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